15/10/2020
Interpreting 0s and 1s as text…
cat helloworld.txt; echo
## Hello World!
Directly looking at the 0s and 1s…
xxd -b helloworld.txt
## 00000000: 01001000 01100101 01101100 01101100 01101111 00100000 Hello ## 00000006: 01010111 01101111 01110010 01101100 01100100 00100001 World!
Directly looking at the 0s and 1s…
xxd -b helloworld.txt
## 00000000: 01001000 01100101 01101100 01101100 01101111 00100000 Hello ## 00000006: 01010111 01101111 01110010 01101100 01100100 00100001 World!
01001000 is H?
cat hastamanana.txt; echo
## Hasta Ma?ana!
Bit, Byte, Word. Figure by Murrell (2009) (licensed under CC BY-NC-SA 3.0 NZ)
We distinguish two basic characteristics:
R-specific)father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
VARIABLE : Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)
FILENAME : ISCCPMonthly_avg.nc
FILEPATH : /usr/local/fer_data/data/
BAD FLAG : -1.E+34
SUBSET : 48 points (TIME)
LONGITUDE: 123.8W(-123.8)
LATITUDE : 48.8S
123.8W
16-JAN-1994 00 9.200012
16-FEB-1994 00 10.70001
16-MAR-1994 00 7.5
16-APR-1994 00 8.100006
<?xml version="1.0"?> <temperatures> <variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable> <filename>ISCCPMonthly_avg.nc</filename> <filepath>/usr/local/fer_data/data/</filepath> <badflag>-1.E+34</badflag> <subset>48 points (TIME)</subset> <longitude>123.8W(-123.8)</longitude> <latitude>48.8S</latitude> <case date="16-JAN-1994" temperature="9.200012" /> <case date="16-FEB-1994" temperature="10.70001" /> <case date="16-MAR-1994" temperature="7.5" /> <case date="16-APR-1994" temperature="8.100006" /> ... </temperatures>
<?xml version="1.0"?> <temperatures> <variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable> <filename>ISCCPMonthly_avg.nc</filename> <filepath>/usr/local/fer_data/data/</filepath> <badflag>-1.E+34</badflag> <subset>48 points (TIME)</subset> <longitude>123.8W(-123.8)</longitude> <latitude>48.8S</latitude> <case date="16-JAN-1994" temperature="9.200012" /> <case date="16-FEB-1994" temperature="10.70001" /> <case date="16-MAR-1994" temperature="7.5" /> <case date="16-APR-1994" temperature="8.100006" /> ... </temperatures>
<?xml version="1.0"?>
<temperatures>
<variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
<filename>ISCCPMonthly_avg.nc</filename>
<filepath>/usr/local/fer_data/data/</filepath>
<badflag>-1.E+34</badflag>
<subset>48 points (TIME)</subset>
<longitude>123.8W(-123.8)</longitude>
<latitude>48.8S</latitude>
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
...
</temperatures>
The actual content we know from the csv-type example above is nested between the ‘temperatures’-tags:
<temperatures> ... </temperatures>
Comparing the actual content between these tags with the csv-type format above, we further recognize that there are two principal ways to link variable names to values.
<variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
<filename>ISCCPMonthly_avg.nc</filename>
<filepath>/usr/local/fer_data/data/</filepath>
<badflag>-1.E+34</badflag>
<subset>48 points (TIME)</subset>
<longitude>123.8W(-123.8)</longitude>
<latitude>48.8S</latitude>
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
<filename>ISCCPMonthly_avg.nc</filename>.<case date="16-JAN-1994" temperature="9.200012" />.Attributes-based:
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
Tag-based:
<cases>
<case>
<date>16-JAN-1994<date/>
<temperature>9.200012<temperature/>
<case/>
<case>
<date>16-FEB-1994<date/>
<temperature>10.70001<temperature/>
<case/>
<case>
<date>16-MAR-1994<date/>
<temperature>7.5<temperature/>
<case/>
<case>
<date>16-APR-1994<date/>
<temperature>8.100006<temperature/>
<case/>
<cases/>
Note the key differences of storing data in XML format in contrast to a flat, table-like format such as CSV:
Note the key differences of storing data in XML format in contrast to a flat, table-like format such as CSV:
Potential drawback of XML: inefficient storage.
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>
JSON:
{"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumber": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
],
"gender": {
"type": "male"
}
}
XML:
<person> <firstName>John</firstName> <lastName>Smith</lastName> </person>
JSON:
{"firstName": "John",
"lastName": "Smith",
}
The following examples are based on the example code shown above (the two text-files persons.json and persons.xml)
# load packages
library(xml2)
# parse XML, represent XML document as R object
xml_doc <- read_xml("persons.xml")
xml_doc
## {xml_document}
## <person>
## [1] <firstName>John</firstName>
## [2] <lastName>Smith</lastName>
## [3] <age>25</age>
## [4] <address>\n <streetAddress>21 2nd Street</streetAddress>\n <city>New York</city>\n <state> ...
## [5] <phoneNumber>\n <type>home</type>\n <number>212 555-1234</number>\n</phoneNumber>
## [6] <phoneNumber>\n <type>fax</type>\n <number>646 555-4567</number>\n</phoneNumber>
## [7] <gender>\n <type>male</type>\n</gender>
# load packages
library(jsonlite)
# parse the JSON-document shown in the example above
json_doc <- fromJSON("persons.json")
# check the structure
str(json_doc)
## Warning: package 'jsonlite' was built under R version 3.6.2
## List of 6 ## $ firstName : chr "John" ## $ lastName : chr "Smith" ## $ age : int 25 ## $ address :List of 4 ## ..$ streetAddress: chr "21 2nd Street" ## ..$ city : chr "New York" ## ..$ state : chr "NY" ## ..$ postalCode : chr "10021" ## $ phoneNumber:'data.frame': 2 obs. of 2 variables: ## ..$ type : chr [1:2] "home" "fax" ## ..$ number: chr [1:2] "212 555-1234" "646 555-4567" ## $ gender :List of 1 ## ..$ type: chr "male"
HyperText Markup Language (HTML), designed to be read by a web browser.
HTML documents/webpages consist of ‘semi-structured data’:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
<h2> hello, world </h2>
</body>
</html>
head and body are nested within the html documenthead, we define the title, etc.head and body are nested within the html documenthead, we define the title, etc.<html>..</html><head>...</head>, <body>...</body><head>...</head>, <body>...</body>HTML (DOM) tree diagram (by Lubaochuan 2014, licensed under the Creative Commons Attribution-Share Alike 4.0 International license).
In this example, we look at Wikipedia’s Economy of Switzerland page.
swiss_econ <- readLines("https://en.wikipedia.org/wiki/Economy_of_Switzerland")
## Warning in readLines("https://en.wikipedia.org/wiki/Economy_of_Switzerland"): incomplete final line
## found on 'https://en.wikipedia.org/wiki/Economy_of_Switzerland'
head(swiss_econ)
## [1] "<!DOCTYPE html>"
## [2] "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">"
## [3] "<head>"
## [4] "<meta charset=\"UTF-8\"/>"
## [5] "<title>Economy of Switzerland - Wikipedia</title>"
## [6] "<script>document.documentElement.className=\"client-js\";RLCONF={\"wgBreakFrames\":!1,\"wgSeparatorTransformTable\":[\"\",\"\"],\"wgDigitTransformTable\":[\"\",\"\"],\"wgDefaultDateFormat\":\"dmy\",\"wgMonthNames\":[\"\",\"January\",\"February\",\"March\",\"April\",\"May\",\"June\",\"July\",\"August\",\"September\",\"October\",\"November\",\"December\"],\"wgRequestId\":\"5175a5f4-c341-459c-971c-84e51d7d22e4\",\"wgCSPNonce\":!1,\"wgCanonicalNamespace\":\"\",\"wgCanonicalSpecialPageName\":!1,\"wgNamespaceNumber\":0,\"wgPageName\":\"Economy_of_Switzerland\",\"wgTitle\":\"Economy of Switzerland\",\"wgCurRevisionId\":981856120,\"wgRevisionId\":981856120,\"wgArticleId\":27465,\"wgIsArticle\":!0,\"wgIsRedirect\":!1,\"wgAction\":\"view\",\"wgUserName\":null,\"wgUserGroups\":[\"*\"],\"wgCategories\":[\"CS1 maint: archived copy as title\",\"CS1 German-language sources (de)\",\"Articles with German-language sources (de)\",\"Webarchive template wayback links\",\"Articles with French-language sources (fr)\",\"Wikipedia articles needing clarification from January 2019\","
Search for specific content
line_number <- grep('US Dollar Exchange', swiss_econ)
line_number
## [1] 216
swiss_econ[line_number]
## [1] "<th>US Dollar Exchange"
# install package if not yet installed
# install.packages("rvest")
# load the package
library(rvest)
# parse the webpage, show the content
swiss_econ_parsed <- read_html("https://en.wikipedia.org/wiki/Economy_of_Switzerland")
swiss_econ_parsed
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="U ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-Eco ...
Now we can easily separate the data/text from the html code. For example, we can extract the HTML table containing the data we are interested in as a data.frames.
tab_node <- html_node(swiss_econ_parsed,
xpath = "//*[@id='mw-content-text']/div/table[2]")
tab <- html_table(tab_node)
tab
## Year GDP (billions of CHF) US Dollar Exchange ## 1 1980 184 1.67 Francs ## 2 1985 244 2.43 Francs ## 3 1990 331 1.38 Francs ## 4 1995 374 1.18 Francs ## 5 2000 422 1.68 Francs ## 6 2005 464 1.24 Francs ## 7 2006 491 1.25 Francs ## 8 2007 521 1.20 Francs ## 9 2008 547 1.08 Francs ## 10 2009 535 1.09 Francs ## 11 2010 546 1.04 Francs ## 12 2011 659 0.89 Francs ## 13 2012 632 0.94 Francs ## 14 2013 635 0.93 Francs ## 15 2014 644 0.92 Francs ## 16 2015 646 0.96 Francs ## 17 2016 659 0.98 Francs ## 18 2017 668 1.01 Francs ## 19 2018 694 1.00 Francs
“One way to answer this question is to consider the sum total of data held by all the big online storage and service companies like Google, Amazon, Microsoft and Facebook. Estimates are that the big four store at least 1,200 petabytes between them. That is 1.2 million terabytes (one terabyte is 1,000 gigabytes).” (Gareth Mitchell, ScienceFocus)
Murrell, Paul. 2009. Introduction to Data Technologies. London, UK: CRC Press.